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Abstract 

We study algorithms for the fast computation of modular inverses. 
Newton-Raphson iteration over p-adic numbers gives a recurrence relation 
computing modular inverse modulo p m , that is logarithmic in m. We 
solve the recurrence to obtain an explicit formula for the inverse. Then we 
study different implementation variants of this iteration and show that our 
explicit formula is interesting for small exponent values but slower or large 
exponent, say of more than 700 bits. Overall we thus propose a hybrid 
combination of our explicit formula and the best asymptotic variants. 
This hybrid combination yields then a constant factor improvement, also 
for large exponents. 

1 Introduction 

The multiplicative inverse modulo a prime power is fundamental for the arith- 
metic of finite rings (see e.g. [1] and references therein). It is also used for 
instance to compute homology groups in algebraic topology for image pattern 
recognition [4] , mainly to improve the running time of algorithms working mod- 
ulo prime powers. Those can be used for the computation of the local Smith 
normal form [5, 6], for instance in the context of algebraic topology [4, algorithm 
LRE]. 

Classical algorithms to compute a modular inverse uses the extended Eu- 
clidean algorithm and Newton-Raphson iteration over p-adic fields, namely 
Hensel lifting [8]. Arazi and Qi in [1] lists also some variants adapted to the 
binary characteristic case that cut the result in lower and higher bits. 

In the following, we give another proof of Arazi and Qi's logarithmic formula 
using Hensel lifting. Then we derive an explicit formula for the inverse that 
generalizes to any prime power. Finally, we study the respective performance 
of the different algorithms both asymptotically and in practice and introduce a 
hybrid algorithm combining the best approaches. 
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2 Hensel's lemma modulo p m 

For the sake of completeness, we first give here Hensel's lemma and its proof 
from Newton-Raphson's iteration (see e.g. [2, Theorem 7.7.1] or [3, §4.2] and 
references therein). 

Lemma 1 (Hensel). Let p be a prime number, m 6 N, / <E 1i[X] and r G Z 
such that f{r) — mod p m . If f'(r) ^ mod p m and 

t = JM f{r) - K 



Ira 



then s = r + tp rn satisfies f{s) = mod p 

Proof. Taylor expansion gives that f(r+tp m ) = f(r)+tp m f (r) + 0(p 2m ). Thus 
if t = —^rf'(r)^ 1 , the above equation becomes f(s) — mod p 2m . □ 



m 



3 Inverse modulo 2 

Now, in the spirit of [7], we apply this lemma to the inverse function 

Fa{x) = — - 1 (1) 

ax 

3.1 Arazi and Qi's formula 

We denote by an under-script r, (resp. h) the lower (resp. higher) part in binary 
format for an integer. From Equation (1) and Lemma 1 modulo 2 4 , if r = a -1 
mod 2 l , then we immediately get 

^ _ ax / 

2 l \ ax 2 

In other words t = 1 ~f r mod 2*. Now let a = b + 2 l ajf mod 2 2% so that we 
also have r = b^ 1 mod 2* and hence rb = 1 + 2 l a with < a < 2 l . Thus 
ar = br + 2Va# = 1 + 2*(a + ran) which shows that 

t = -(a + ra H )r = - {{rb) H + (ra H ) L )r mod 2' (2) 

The latter is exactly [1, Theorem 1] and yields the following Algorithm 1, where 
the lower and higher parts of integers are obtained via masking and shifting. 

Lemma 2. Algorithm 1 requires 13|log 2 (m)J + 1 arithmetic operations. 
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Algorithm 1 Arazi&Qi Quadratic Modular inverse modulo 2 



Input: a € Z odd and m 6 A. 
Output: U = a- X mod 2 m . 



1 


U = l; 




2 


for (?' = 1; i < m; i <<= 1) do 




3 


& = a&(2 i -l); 


{6 = a mod 2 1 } 


4 


ti = U *b; ti »= i; 


{(r&M 


5 


c= (a » i) &(2 4 - 1); 




6 


*2 = (f/*c) &(2* - 1); 


{(ra H ) L } 


7 


tx + = t 2 ; 




8 


= (2 1 - 1); 


{-0 


9 






10 


*i <<= i; 




11 


U\ = h; 


{r + i2 4 } 


12 


end for 




13 


U & = (2 m - 1); 


{r mod 2 m } 


14 


return {/ ; 





3.2 Recurrence formula 

Another view of Newton- Raphson's iteration is to create a recurrence. Equation 
(1) gives 

U n +i = U n - " 1 =U n - (aU n - l)U n 
= U n {2-aU n ) 

This yields the loop of Algorithm 2, for the computation of the inverse, see 
e.g. [8] or [3, §2.4]. 

Lemma 3. Algorithm 2 is correct and requires 6[log 2 (m)] +2 arithmetic oper- 
ations. 

Proof. The proof of correctness is natural in view of the Hensel lifting. First 
Uq = a -1 mod p. Second, by induction, suppose a ■ U n = 1 modp k . Then 
aU n = 1+Xp k anda(/„ + i = aU n (2-aU n ) = (1+Xp k ){2-1-Xp k ) = (1-X 2 p 2k = 
1 mod p 2k ). Finally U n = a -1 mod p 2 " . □ 

Remark 1. We present this algorithm for computations modulo p m but its opti- 
mization modulo a power of 2 is straightforward: replace the modular operations 
of for instance lines 4> 6 etc. by a binary masking: x & = (2 1 — 1). 

Remark 2. It is important to use a squaring in line 3. Indeed squaring can 
be faster than multiplication, in particular in the arbitrary precision setting [9J. 
In the case of algorithm 2, the improvement over an algorithm of the form 
temp = 2 — a * U; temp% = p rn ; U* = temp; U% = p m ; is of about 30%. 
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Algorithm 2 Hensel Quadratic Modular inverse 



Input: a € Z odd, p is a prime and m € iV . 
Output: [/ = a -1 mod p m . 



1 


U = a ^ mod p; 


{extended gcd} 


2 


for (i = 2; i < z <<C= 1) do 




q 

a 


temp — U ^ u; 


\ U n) 


4 


temp % = p 1 ; 


{temp mod p l { 
{aU 2 n } 


5 


temp * = a; 


6 


temp % = p 1 ; 


{temp mod p 1 } 


7 


U «= 1; 


{2U n } 


8 


U — = temp; 


{U n (2-aU n )} 


9 


end for 




10 


temp — U *U; 




11 


temp % = p m ; 


{temp mod p m } 


12 


temp * = a; 




13 


temp % = p m ; 


{temp mod p 2 } 


14 


U «= 1; 


{2C/„} 


15 


U — — temp; 


{U n (2-aU n )} 
{U modp 2 ™} 


16 


U% = p m ; 


17 


return U ; 





Remark 3. Note that for algorithms 1 and 2, a large part of the computation 
occur during the last iteration of the loop when 2 % is closest to 2 m . Therefore, 
a recursive version cutting in halves will be more efficient in practice since the 
latter will be exactly done at i = m/2 instead of at the largest power of 2 lower 
than m. Moreover this improvement will take place at each recursion level. We 
thus give in the following the recursive version for formula (3), the one for a 
recursive version of Arazi&Qi is in the same spirit. 

3.3 Factorized formula 

We now give an explicit formula for the inverse by solving the preceding recur- 
rence relation, first in even characteristic. 

We denote by H n = aU n a new sequence, that satisfies H n+ i = H n {2 — H n ). 
With Ho = a we get Hi = a{2 — a) = 2a — a 2 = 1 — (a — l) 2 , by induction, 
supposing that H n = 1 — (a — l) 2 ,we get 

H n+1 = (l-(a- l) 2 ") (2 - 1 + (a - l) 2 ") 

= l 2 -((a-l) 2 ") 2 = l-(a-l) 2 " +1 

Using the remarkable identity, this in turn yields H n = a{2— a) fL =1 
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Algorithm 2' Recursive Hensel 
Input: a € Z odd, p is a prime and m £ N . 
Output: r = a -1 mod p m . 
1: if m == 1 then return a -1 mod p; end if {ext. gcd} 

h = [— ] 

6 = a&(2 /l -l); {i = a mod2 il } 

r =RecursiveHensel(6, h); 

temp = r * r; {f 2 } 
temp % = p , {temp mod p h } 

temp * = a; {ar 2 } 
temp % = p ; {temp mod p' 1 } 

r «= 1; {2r} 
r — = temp; { r (2 — ar )} 

r % = p m ; {r mod p 2 } 



return r; 



therefore, with Uq — 1 and U\ = 2 — a we have that 

^ = (2-a)n(l + («-l) 2 *) (4) 
The latter equation gives immediately rise to the following algorithm 3. 

Algorithm 3 Explicit Quadratic Modular inverse modulo 2 m 
Input: a € Z odd and m E N. 
Output: [/-Ea- 1 mod2 m . 
1 

2 
3 
1 



Let s and £ be such that a = 2 s t + 1; 
U = 2 -a; 
amone = a — 1; 

for (i = 1; i < ™; i <<= 1) do 

amone * = amone; {square: (a — l) 2 } 

amone & = (2 m - 1); {(a - l) 2 * mod 2 m } 

U * = (amone + 1); 

[/ & = (2 m - 1); {U mod 2 m } 

end for 
return U ; 



Lemma 4. Algorithms 3 is correct and requires 5|log 2 (Y)J + 2 arithmetic op- 
erations. 

Proof. Modulo 2 m , a is invertible if and only if a is odd, so that a — 2 s t + 1 and 
therefore, using Formula (4), we get aU n = H n = 1 - (a - l) 2 " = 1 - (2 s i) 2 " = 1 
mod 2 s2 ". Thus, f7fi og2 (ffl)l mod 2 m = a" 1 mod 2 m . □ 

There are two major points to remark with this variant: 
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1. It makes less operations than previous algorithms. 



2. It must compute with the full p-adic development (modulo operations are 
made modulo 2 m and not 2\ 

Therefore we will see that this algorithm has a worse asymptotic complexity 
but is very efficient in practice for small exponents. 

3.4 Generalization modulo any prime power 

The formula generalizes directly for any prime power: 

Theorem 1. Let p be a prime number, a coprime to p and b = a -1 mod p is 
the inverse of a modulo p. Let also V n be the following sequence: 



Then V n = a 1 mod p 2 . 

Proof. The proof is similar to that of Lemma 3 and follows also from Hensel's 
lemma. From the analogue of Equation (4), we have a ■ V n = 1 — (ab — l) 2 . 
Now as a ■ b = 1 + Xp, by the definition of b we have a ■ V n = 1 — (ab — l) 2 = 



4 Experimental comparisons 

The point of the classical Newton-Raphson algorithms (as well as Arazi and 
Qi's variant) is that it works with modular computations of increasing sizes, 
whereas the explicit formula requires to work modulo the highest size from the 
beginning. On the hand we show next that this gives an asymptotic advantage 
the recurring relations. On the other hand, in practice, the explicit formula 
enables much faster performance for say cryptographic sizes. 

4.1 Over word-size integers 

Using word-size integers, the many masking and shifting required by recurring 
relations do penalize the performance, where the simpler Algorithm 3 is on 
average 26% faster on a standard desktop PC, as shown on Figure 1. Differently, 
Arazi and Qi's variant suffers from the manipulations required to extract the 
low and high parts of integers. 

4.2 Over arbitrary precision arithmetic 

We first provide the equivalents of the complexity results of the previous section 
but now for arbitrary precision: the associated binary complexity bounds for 
the different algorithms supports then the asymptotic analysis in the beginning 
of this section. 




(5) 



1 — (Ap) 2 " = 1 mod p 2 " . 



□ 
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Inverse modulo 2 m on a 64 bits Intel Xeon W3530, 2.80GHz 
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power of two exponent (long int) 

Figure 1: Modular inverse on 64 bits machine words 



60 



Lemma 5. Using classical arithmetic, Algorithm 1 requires 

O (2m 2 + 10m) binary operations. 

Proof. We suppose that masking and shifting as well as addition are linear and 
that multiplication is quadratic (O (2m 2 ) operations to multiply to elements of 

size m). Then the complexity bound becomes O (m + ^J^M-i 3 . 2 (2 j ) 2 + 1(2 3 ') 

O(2m 2 + 10m). □ 

Lemma 6. Using classical arithmetic in even characteristic modulo 2 m , Algo- 
rithm 2 requires 

O (^ m2 + 9m^ binary operations. 

Proof. We suppose that masking and shifting as well as addition are linear 
and that multiplication is quadratic. Then the complexity bound becomes 



(2 • 2m 2 + 5m + £jf| (m)_1 2 • 2(2^) 2 + 4(2*)) = O 



[f m 2 



9m). 



□ 



Lemma 7. Using classical arithmetic, Algorithm 3 requires 

O ((4m 2 + 2m) Llog 2 (m)J) binary operations. 

Proof. Similarly, here we have O ^°ii (m)_1 2 • 2m 2 + 2m) = O ((4m 2 + 2m) log 2 (m)) . 



□ 
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Thus, we see that the explicit formula adds a logarithmic factor, asymptot- 
ically. In practice, Figure 2 shows that using GMP 1 the asymptotic behavior 
of Algorithm 1 becomes predominant only for integers with more than 1200 
bits. For the Newton-Raphson iteration the asymptotic behavior of Algorithm 
2 becomes predominant even sooner, for integers with about 640 bits. Below 
that size, Algorithm 3 is better. 




Figure 2: Modular inverse on arbitrary precision integers 



Remark 4. Now, in [8, 7], the recurrence relation from (3), is extended? to 
X n+ i = 1 ~( 1 ~^ x ™'> j or a fixed r. This allows a faster convergence, by a factor 
oflog 2 (r). Unfortunately the price to pay is to compute a r-th power at each 
iteration (instead of a single square), which could be done, say by recursive 
binary squaring, but at a price o/log 2 (r) squarings and between and log 2 (r) 
multiplications. Overall there would be no improvement in the running time. 



4.3 Hybrid algorithm 

With the threshold of Figure 2 and from the previous algorithms, we can then 
construct the hybrid algorithm 4, which is better than all of them everywhere. 
For small exponents it uses the explicit formula, then for larger exponents it 
uses the classical Hensel formula and switches to Arazi&Qi formula only for 
exponents larger than 9000 bits. It switches back to Hensel formula after 10 6 

x http : //gmplib . org 

2 this is to be compared with explicit formula (4), V n +\ = V n (l + (1 — ab) 2 ), where the 
computation is done with the first inverse modp (recall that b = a~ 1 mod p), where in the 
classical setting the computation is done with the inverse so far: X n 
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bits: indeed on the used computer quasi linear multiplication via FFT comes 
into play in GMP and the analysis of section 4.2 is not relevant anymore. 



Algorithm 4 Hybrid Modular inverse 
Input: a G Z odd, to € N. 
Output: u = a -1 mod 2 m . 

1: if to < 640 then 

2: return Explicit(a, m); 

3: else 

4: Leth=\f]; 

5: Let b = a mod 2 h ; 

6: r =Hybrid(6,2,/i); 

7: if (to < 9000) or (to > 10 6 ) then 

8: return u = 2r — br 2 mod 2 m ; 

9: else 

10: return u = r + t2 h mod 2 m ; 
11: end if 
12: end if 



Finally, Algorithm 4 is on average 21% times faster than any other direct 
lifting alone as shown with the curve (4) of Figure 2 (recall that on Figure 2 
ordinates are presented in a logarithmic scale) and also on the ratios of Figure 3. 



5 Conclusion 

We have studied different variants of Newton-Raphson's iteration over p-adic 
numbers to compute the inverse modulo a prime power. We have shown that a 
new explicit formula can be up 26% times faster in practice than the recursive 
variants for small exponents. Asymptotically, though, the latter formula suffers 
from a supplementary logarithmicfactor in the power (or a doubly logarithmic 
factor in the prime power) that makes it slower for large arbitrary precision 
integers, however, using each one of the best two algorithms in their respective 
regions of efficiency, we were able to make a hybrid algorithm with improved 
performance of 21% on average at any precision. 
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